Introduction

The aim of this study is to develop some forecasting approaches and to manipulate the data provided by Trendyol so that necessary interpretations can be made. During the study, seasonality for each product will be examined. Then necessary decomposition will be done to use ARIMA models. Furtherly, possible regressors will be checked and ARIMAX, SARIMAX models will be utilized. Best model is going to be chosen according to their performance and test results. Necessary interpretations will be made at each step.

Product names used in the work:

1)Yüz Temizleyici

2)Bebek Islak Mendili

3)Xiaomi Bluetooth Kulaklık

4)Fakir Süpürge

5)Tayt

6)Diş Fırçası

7)Mont

8)Bikini Model-1

9)Bikini Model-2

Step 0

In this step, required libraries are called, necessary data tables are created and the other very beginning operations are executed.

library(data.table)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(mgcv)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
library(fpp)
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
## 
##     getResponse
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: tseries
require(gratia)
## Loading required package: gratia
library(readxl)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)

g13<-read_excel("g13.xlsx") 
dat<-as.data.table(g13)   # Main Data Table

yuzt<-dat[1:372,]        # Data table which holds the data of Product-1 (Yüz Temizleyici)
mendil<-dat[373:744,]    # Data table which holds the data of Product-2 (Bebek Islak Mendili)
xiaomi<-dat[745:1116,]   # Data table which holds the data of Product-3 (Xiaomi Bluetooth Kulaklık)
fakir<-dat[1117:1488,]   # Data table which holds the data of Product-4 (Fakir Süpürge)
tayt<-dat[1489:1860,]    # Data table which holds the data of Product-5 (Tayt)
bik1<-dat[1861:2232,]    # Data table which holds the data of Product-6 (Bikini Model-1)
firca<-dat[2233:2604,]   # Data table which holds the data of Product-7 (Diş Fırçası)
mont<-dat[2605:2976,]    # Data table which holds the data of Product-8 (Mont)
bik2<-dat[2977:3348,]    # Data table which holds the data of Product-9 (Bikini Model-2)

Product 1 (Yüz Temizleyici)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-1 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-1 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(yuzt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=15

Required codes are below:

ts12=ts(yuzt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")

The reason why acf is really high at lag 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month, which justifies high acf value around lag 15. (Source) Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. So, another seasonality is observed roughly at each 15 days. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.

Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.

Frequency=30

Required codes are below:

ts13=ts(yuzt$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")

Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3.AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.1893  -0.1191  -0.3100     0.9748
## s.e.  0.0496   0.0503   0.0497     0.0127
## 
## sigma^2 estimated as 0.09049:  log likelihood = -79.89,  aic = 169.78
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of the nature of AR models.

Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.

MA(3)

MA(3) is creted below:

model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
## 
## Coefficients:
##          ma1      ma2      ma3  intercept
##       0.0885  -0.2125  -0.3677     0.9748
## s.e.  0.0472   0.0503   0.0440     0.0080
## 
## sigma^2 estimated as 0.08918:  log likelihood = -77.27,  aic = 164.54
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals show constant mean but variance seems to differ more in this model. Eventhough ACF and PACF values are better in this model, they are still high.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(3,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2    mean
##       0.2382  0.2147  -0.3872  -0.1166  -0.4240  0.9748
## s.e.  0.2073  0.1875   0.0779   0.2199   0.1803  0.0076
## 
## sigma^2 estimated as 0.08708:  log likelihood=-69.97
## AIC=153.93   AICc=154.25   BIC=181.25
model13=arima(decomposed11$random,order=c(3,0,2)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(3,0,2)")

Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] 169.7801
BIC(model11)
## [1] 189.2933
AIC(model12)
## [1] 164.5391
BIC(model12)
## [1] 184.0523
AIC(model13)
## [1] 153.9338
BIC(model13)
## [1] 181.2523

As it is seen above, model13 which is autoarima with ARIMA(3.0.2) is the best model with lower values. ARIMA(3,0,2) should be selected for further steps.

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2702096 17.51891 0.2361159 0.2361159
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2703676 17.21833 0.2320647 0.2320647
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.2619928 16.86907 0.2273575 0.2273575

Again third result which is ARIMA(3,0,2) gives the best result with lower errors in all error tests methods.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(yuzt$sold_count,yuzt$price)
## [1] -0.2299915
correlation(yuzt)
##              Variable Correlation
## 1         visit count   0.3144539
## 2        basket count   0.8194234
## 3       favored count   0.4509207
## 4       category sold   0.6230038
## 5      category visit   0.1228725
## 6     category basket   0.2880863
## 7    category favored   0.6858087
## 8 category brand sold   0.3925289

Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$basket_count)
## 
## Coefficients:
##          ar1     ar2      ar3    ma1      ma2  intercept  yuzt$basket_count
##       0.1656  0.1448  -0.3539  2e-04  -0.2923     0.8806              3e-04
## s.e.  0.1401  0.1554   0.0039    NaN      NaN        NaN                NaN
## 
## sigma^2 estimated as 0.07997:  log likelihood = -57.27,  aic = 130.53
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_sold)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_sold)
## 
## Coefficients:
##           ar1      ar2      ar3     ma1     ma2  intercept  yuzt$category_sold
##       -0.0960  -0.0457  -0.2361  0.3688  0.0609     0.8011               4e-04
## s.e.   0.1352   0.1309      NaN     NaN     NaN        NaN                 NaN
## 
## sigma^2 estimated as 0.07733:  log likelihood = -51.05,  aic = 118.1
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(3,0,2),xreg=yuzt$category_favored)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 2), xreg = yuzt$category_favored)
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2  intercept
##       0.1691  0.1876  -0.3685  -0.0422  -0.3731     0.8898
## s.e.  0.1974  0.1762   0.0707   0.2061   0.1654     0.0137
##       yuzt$category_favored
##                           0
## s.e.                    NaN
## 
## sigma^2 estimated as 0.08231:  log likelihood = -62.61,  aic = 141.22
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n     mean       sd        CV        MAPE       MAD        MADP       WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004031973 0.4117975 0.005550114 0.005550114
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n     mean       sd        CV        MAPE       MAD        MADP       WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004048837 0.3541774 0.004773523 0.004773523
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n     mean       sd        CV        MAPE       MAD        MADP       WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004150725 0.3935481 0.005304152 0.005304152
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n     mean       sd        CV        MAPE       MAD        MADP       WMAPE
## 1 372 74.19624 63.20166 0.8518175 0.004010801 0.4028655 0.005429729 0.005429729

As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.

Final Step

After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.

Product 2 (Islak Mendil)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-2 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 15th and 60th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-2 ought to be chosen at the frequency of 15, which will cover autocorrelation of lag 60 too. However, just like the previous case in product-1; if lag 7 is chosen, the decomposition results will be better compared to lag 15. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(mendil$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=15

Required codes are below:

ts12=ts(mendil$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")

Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to increase more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.

Frequency=30

Required codes are below:

ts13=ts(mendil$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")

Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.3581  -0.2436  -0.2027     0.9602
## s.e.  0.0511   0.0528   0.0510     0.0150
## 
## sigma^2 estimated as 0.09778:  log likelihood = -94.08,  aic = 198.16
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.

Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs.

MA(3)

MA(3) is creted below:

model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
## 
## Coefficients:
##          ma1      ma2      ma3  intercept
##       0.2405  -0.2537  -0.3837     0.9598
## s.e.  0.0483   0.0581   0.0469     0.0099
## 
## sigma^2 estimated as 0.09676:  log likelihood = -92.23,  aic = 194.46
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals show nearly constant mean but variance seems to jump in this model. Eventhough ACF and PACF values are better in this model, they are still high.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(2,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1    mean
##       1.0134  -0.5088  -0.7209  0.9597
## s.e.  0.0616   0.0450   0.0593  0.0091
## 
## sigma^2 estimated as 0.09503:  log likelihood=-86.95
## AIC=183.9   AICc=184.07   BIC=203.41
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(2,0,1)")

Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] 198.1614
BIC(model11)
## [1] 217.6746
AIC(model12)
## [1] 194.4632
BIC(model12)
## [1] 213.9764
AIC(model13)
## [1] 183.899
BIC(model13)
## [1] 203.4122

As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n     mean       sd       CV      MAPE      MAD      MADP     WMAPE
## 1 372 385.1452 422.3167 1.096513 0.2991687 104.9772 0.2725654 0.2725654
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n     mean       sd       CV      MAPE      MAD      MADP     WMAPE
## 1 372 385.1452 422.3167 1.096513 0.3054549 102.8093 0.2669366 0.2669366
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n     mean       sd       CV     MAPE      MAD      MADP     WMAPE
## 1 372 385.1452 422.3167 1.096513 0.298858 102.0573 0.2649839 0.2649839

Again third result which is ARIMA(2,0,1) gives the best result in MAD, MAPE, MADP and nearly equal to MA model in WMAPE,which is better than AR(3) model. According to performances, ARIMA model is better than MA model, MA model is better than AR model.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(mendil$sold_count,mendil$price)
## [1] -0.5721201
correlation(mendil)
##              Variable Correlation
## 1         visit count   0.3002766
## 2        basket count   0.8872024
## 3       favored count   0.2832415
## 4       category sold   0.9157251
## 5      category visit   0.4285587
## 6     category basket   0.1962214
## 7    category favored   0.7963313
## 8 category brand sold   0.1320124

Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$basket_count)
## 
## Coefficients:
##          ar1     ar2     ma1  intercept  mendil$basket_count
##       0.5269  0.0123  0.1696     0.6284                3e-04
## s.e.  0.3535  0.2216  0.3458     0.0303                  NaN
## 
## sigma^2 estimated as 0.06762:  log likelihood = -26.61,  aic = 65.22
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_sold)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_sold)
## 
## Coefficients:
##          ar1      ar2     ma1  intercept  mendil$category_sold
##       0.4713  -0.2106  0.0706     0.7133                 1e-04
## s.e.  0.1489   0.0604  0.1319     0.0302                   NaN
## 
## sigma^2 estimated as 0.07551:  log likelihood = -46.71,  aic = 105.43
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=mendil$category_favored)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = mendil$category_favored)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  mendil$category_favored
##       0.6261  -0.3234  -0.1738     0.7596                        0
## s.e.  0.1341   0.0466   0.1242     0.0306                      NaN
## 
## sigma^2 estimated as 0.08683:  log likelihood = -72.27,  aic = 156.53
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 183.899
AIC(modelx1)
## [1] 65.22218
AIC(modelx2)
## [1] 105.4285
AIC(modelx3)
## [1] 156.5311

For initial comparision, model arimax with regressor as basket count seems to have better AIC value.

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n     mean       sd       CV        MAPE      MAD        MADP       WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001917651 1.233079 0.003201597 0.003201597
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n     mean       sd       CV        MAPE      MAD        MADP       WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002547509 1.521034 0.003949249 0.003949249
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n     mean       sd       CV        MAPE      MAD        MADP       WMAPE
## 1 372 385.1452 422.3167 1.096513 0.001909252 1.182135 0.003069323 0.003069323
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n     mean       sd       CV        MAPE      MAD        MADP       WMAPE
## 1 372 385.1452 422.3167 1.096513 0.002345371 1.412103 0.003666417 0.003666417

As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.

Final Step

After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.

Product 3 (Xiaomi Bluetooth Kulaklık)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-3 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 30th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-3 ought to be chosen at the frequency of 7 or 30 but we had better determine the seasonality after decomposition stage. There is weekly and maybe monthly seasonality for this product.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(xiaomi$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend seems to gradually increase and decrease with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=30

Required codes are below:

ts13=ts(xiaomi$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")

Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease. However, acf of detrended values are not that bad. But its random values are worse than weekly one. Seasonal component seems to have larger deviation, which may fluctuate results more.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.1838  -0.1684  -0.3338     0.9851
## s.e.  0.0492   0.0493   0.0491     0.0094
## 
## sigma^2 estimated as 0.05641:  log likelihood = 6.56,  aic = -3.11
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.

Residuals show constant mean but variance increases.

MA(3)

MA(3) is creted below:

model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
## 
## Coefficients:
##          ma1      ma2      ma3  intercept
##       0.0280  -0.2659  -0.4802     0.9848
## s.e.  0.0423   0.0519   0.0475     0.0034
## 
## sigma^2 estimated as 0.05211:  log likelihood = 20.81,  aic = -31.62
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals are really similar to the first model. AIC value is better in this model.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(5,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ar5      ma1     ma2      ma3
##       0.2611  -0.6062  0.1784  -0.1918  -0.2221  -0.1929  0.3040  -0.5984
## s.e.  0.1403   0.1237  0.1092   0.0840   0.0689   0.1381  0.1086   0.0989
##         mean
##       0.9851
## s.e.  0.0038
## 
## sigma^2 estimated as 0.0501:  log likelihood=32.35
## AIC=-44.69   AICc=-44.07   BIC=-5.67
model13=arima(decomposed11$random,order=c(5,0,3)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(5,0,3)")

Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] -3.110113
BIC(model11)
## [1] 16.40305
AIC(model12)
## [1] -31.62326
BIC(model12)
## [1] -12.1101
AIC(model13)
## [1] -44.69223
BIC(model13)
## [1] -5.665892

As it is seen above, model13 which is autoarima with ARIMA(5.0.3) is the best in AIC value. However, MA model is better than ARIMA in BIC because of parameter punsihment of BIC calculation.

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n     mean      sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1860551 66.12454 0.1686712 0.1686712
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n     mean      sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1814687 63.31901 0.1615148 0.1615148
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n     mean      sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 392.0323 218.106 0.5563472 0.1754918 61.90513 0.1579083 0.1579083

ARIMA(5,0,3) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(5,0,3) should be selected for further steps.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(xiaomi$sold_count,xiaomi$price)
## [1] -0.5127488
correlation(xiaomi)
##              Variable Correlation
## 1         visit count  0.21153610
## 2        basket count  0.86567764
## 3       favored count  0.22808036
## 4       category sold  0.53232294
## 5      category visit  0.01179411
## 6     category basket  0.06057578
## 7    category favored  0.27347833
## 8 category brand sold  0.06973421

Obviously there is high correlation between sold counts and those basket count, category sold and price. This result actually makes sense because people tend to buy what they favored or basketed.Moreover, high prices makes people buy less. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$basket_count)
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5      ma1     ma2      ma3
##       0.2035  -0.7361  -0.0094  -0.1607  -0.3037  -0.0914  0.4821  -0.3312
## s.e.  0.1565   0.1304   0.1136   0.0761   0.0642   0.1576  0.1160   0.0926
##       intercept  xiaomi$basket_count
##          0.8694                1e-04
## s.e.     0.0091                  NaN
## 
## sigma^2 estimated as 0.04347:  log likelihood = 53.98,  aic = -85.97
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$category_sold)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$category_sold)
## 
## Coefficients:
##           ar1     ar2     ar3      ar4     ar5     ma1     ma2      ma3
##       -0.0373  0.2239  0.5257  -0.1746  0.2020  0.6214  0.0901  -0.5859
## s.e.   0.1237  0.0956  0.0976   0.0958  0.0579  0.1326  0.1572   0.1344
##       intercept  xiaomi$category_sold
##          0.4151                 7e-04
## s.e.     0.0697                 1e-04
## 
## sigma^2 estimated as 0.03922:  log likelihood = 72.69,  aic = -123.39
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(5,0,3),xreg=xiaomi$price)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(5, 0, 3), xreg = xiaomi$price)
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ar5      ma1     ma2      ma3
##       0.2516  -0.6212  0.1537  -0.1925  -0.2367  -0.1888  0.3213  -0.5803
## s.e.  0.1470   0.1232  0.1084   0.0851   0.0686   0.1467  0.1095   0.1029
##       intercept  xiaomi$price
##          1.1520       -0.0012
## s.e.     0.0557        0.0004
## 
## sigma^2 estimated as 0.04731:  log likelihood = 38.31,  aic = -54.61
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] -44.69223
AIC(modelx1)
## [1] -85.96773
AIC(modelx2)
## [1] -123.3877
AIC(modelx3)
## [1] -54.6138

For initial comparision, although all AIC values are negative and quite few, model arimax with regressor as category sold seems to have better AIC value. However, test results may differ.

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n     mean      sd        CV        MAPE      MAD        MADP       WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003007114 1.334838 0.003404918 0.003404918
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n     mean      sd        CV        MAPE      MAD        MADP       WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002735178 1.240497 0.003164274 0.003164274
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n     mean      sd        CV        MAPE      MAD        MADP       WMAPE
## 1 372 392.0323 218.106 0.5563472 0.003638369 1.522088 0.003882558 0.003882558
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n     mean      sd        CV        MAPE     MAD        MADP       WMAPE
## 1 372 392.0323 218.106 0.5563472 0.002616265 1.19986 0.003060615 0.003060615

To interpret the test, model arimax3 (arimax with regressor as price) gives the least parameter results. In all parameters such as WMAPE, arimax3 is the best eventhough its relatively higher AIC value.

Final Step

After all of tests and operations done above, arimax3 (regressor as price) model gives best results and it is selected as final model.

Product 4 (Fakir Süpürge)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-4 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 7th and 14th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But product-4 should be chosen at the frequency of 7 because if we choose lag7 it may cover lag 14 as well. So there is weekly seasonality.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(fakir$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance (except for some jumps) and nearly constant mean. Its trend does not tell us anything important. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=14

Required codes are below:

ts12=ts(fakir$sold_count, frequency = 14)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=14")

plot(decomposed12)

Autocorrelation seems higher than the previous one. Reason might be that we do not cover weekly pattern but if we choose lag 7, we can cover lag 14. So frequency=14 is worse than frequency=7.

Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series seems more wavy but it tends to decrease more. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.

Frequency=30

Required codes are below:

ts13=ts(fakir$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")

Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component does not show any clue to increase or decrease because it decreases first then increases in the middle. Acf of detrended values are not very well. But its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms because of larger frequency.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 3 .AR(3) model can be proposed. For decreasing PACF part, ACF pilot does not show significant figures after lag 3, so MA(3) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.1085  -0.0945  -0.3236     0.9801
## s.e.  0.0494   0.0495   0.0500     0.0108
## 
## sigma^2 estimated as 0.07311:  log likelihood = -40.83,  aic = 91.67
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.

Residuals show constant mean but variance increases.

MA(3)

MA(3) is creted below:

model12=arima(decomposed11$random,order=c(0,0,3))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 3))
## 
## Coefficients:
##           ma1      ma2      ma3  intercept
##       -0.0202  -0.2480  -0.3644     0.9800
## s.e.   0.0478   0.0487   0.0430     0.0051
## 
## sigma^2 estimated as 0.07012:  log likelihood = -33.33,  aic = 76.65
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals are closer to 0 in this model rather than first model. AIC value is better in this model.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(4,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ma1    mean
##       0.7425  -0.1878  -0.2388  0.0305  -0.7569  0.9799
## s.e.  0.0810   0.0639   0.0643  0.0638   0.0613  0.0051
## 
## sigma^2 estimated as 0.06838:  log likelihood=-25.8
## AIC=65.61   AICc=65.92   BIC=92.93
model13=arima(decomposed11$random,order=c(4,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(4,0,1)")

Residuals show constant mean but variance differs sometimes. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] 91.6659
BIC(model11)
## [1] 111.1791
AIC(model12)
## [1] 76.65494
BIC(model12)
## [1] 96.1681
AIC(model13)
## [1] 65.60766
BIC(model13)
## [1] 92.9261

As it is seen above, model13 which is autoarima with ARIMA(4.0.1) is the best in AIC value. Second best model is MA(3).

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n     mean       sd        CV      MAPE    MAD      MADP     WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2396726 9.1253 0.2297071 0.2297071
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2376557 8.985335 0.2261838 0.2261838
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.2304129 8.615689 0.2168789 0.2168789

ARIMA(4,0,1) gives the best result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(4,0,1) should be selected for further steps.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(fakir$sold_count,fakir$price)
## [1] -0.32637
correlation(fakir)
##              Variable  Correlation
## 1         visit count -0.102187158
## 2        basket count  0.866866490
## 3       favored count -0.145043073
## 4       category sold  0.759162610
## 5      category visit  0.004137553
## 6     category basket -0.122458472
## 7    category favored  0.567747022
## 8 category brand sold -0.170524082

Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$basket_count)
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ma1  intercept  fakir$basket_count
##       0.6858  -0.1792  -0.2419  0.0176  -0.6888     0.9641               1e-04
## s.e.  0.1246   0.0635   0.0628  0.0689   0.1272     0.0196               2e-04
## 
## sigma^2 estimated as 0.06706:  log likelihood = -25.18,  aic = 66.37
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_sold)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_sold)
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ma1  intercept  fakir$category_sold
##       0.5079  -0.1532  -0.2452  0.0022  -0.4742     0.9095                4e-04
## s.e.  0.2094   0.0633   0.0596  0.0809   0.2270     0.0285                2e-04
## 
## sigma^2 estimated as 0.06431:  log likelihood = -17.39,  aic = 50.78
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(4,0,1),xreg=fakir$category_favored)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(4, 0, 1), xreg = fakir$category_favored)
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ma1  intercept
##       0.7029  -0.1821  -0.2399  0.0218  -0.7198     0.9571
## s.e.  0.0803   0.0625   0.0632  0.0655   0.0555     0.0069
##       fakir$category_favored
##                            0
## s.e.                     NaN
## 
## sigma^2 estimated as 0.06646:  log likelihood = -23.58,  aic = 63.17
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 65.60766
AIC(modelx1)
## [1] 66.36656
AIC(modelx2)
## [1] 50.77947
AIC(modelx3)
## [1] 63.16558

For initial comparision, model arimax2 (regressor as category sold) has the least AIC value. This might be a sign for selection.

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n     mean       sd        CV        MAPE        MAD        MADP
## 1 372 39.72581 39.17959 0.9862504 0.005127835 0.05148335 0.001295967
##         WMAPE
## 1 0.001295967
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n     mean       sd        CV       MAPE        MAD        MADP       WMAPE
## 1 372 39.72581 39.17959 0.9862504 0.00489123 0.04979956 0.001253582 0.001253582
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n     mean       sd        CV        MAPE        MAD        MADP
## 1 372 39.72581 39.17959 0.9862504 0.004614496 0.04775374 0.001202083
##         WMAPE
## 1 0.001202083
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n     mean       sd        CV        MAPE        MAD        MADP
## 1 372 39.72581 39.17959 0.9862504 0.005139033 0.05145274 0.001295197
##         WMAPE
## 1 0.001295197

To interpret the test, all 4 models gives really close results with high performances. We actually can choose whichever we request. Eventhough arimax2 has the least AIC value, test results are close and in order to prevent possible forecast mistakes, model should be chosen as the simplest one, which is the first, classic arima(4,0,1) model without regressors.

Final Step

After all of tests and operations done above, arima(4,0,1) model is selected as final model due to simplicity.

Product 5 (Tayt)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-5 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 14th and 15th lag, autocorrelation values get higher and autocorrelation fades away at later lags. But frequency of 7 will be used too in decomposition part because it has given best results so far in the other products. Finally, it would not be wrong to say that there is similar pattern at each 14 days.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(tayt$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable. Moreover, as it is seen in the graph for random (detrended) values can be counted as stationary with stable variance and approximately constant mean. Its trend does not tell us anything important being wavy. But model might be still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=14

Required codes are below:

ts13=ts(tayt$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")

plot(decomposed13)

Autocorrelation seems little higher than the previous one.

Detrended values (random) are less stationary in this case. They do not have that constant variance and their means are less constant compared to previous one. Trend in this series is not linear, which makes forecasting suffer. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity.

Frequency=15

Required codes are below:

ts12=ts(tayt$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")

plot(decomposed12)

Detrended values are worse in this model. They show a nonstationary behavior due to changing variance and mean. Trend component is again not linear it is more like fluctuating. Acf of detrended values are not bad. Nevertheless, its random values are worse than weekly one because of changing variance and barely constant mean. Seasonal component seems to have larger terms in number because of larger frequency.

For the further parts, eventhough they give not very appropriate results, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(2)

AR(2) is creted below:

model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.5551  -0.3861     0.9639
## s.e.  0.0488   0.0492     0.0182
## 
## sigma^2 estimated as 0.08323:  log likelihood = -64.61,  aic = 137.22
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.

Residuals show constant mean but variance changes a lot. There is still some autocorrelation from ACF and PACF

MA(5)

MA(5) is creted below:

model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
## 
## Coefficients:
##          ma1      ma2      ma3      ma4      ma5  intercept
##       0.4285  -0.1421  -0.3891  -0.2371  -0.0751     0.9629
## s.e.  0.0522   0.0578   0.0520   0.0532   0.0526     0.0086
## 
## sigma^2 estimated as 0.07724:  log likelihood = -51.05,  aic = 116.09
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals look same as AR(2) model. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation handled better in this model.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(2,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1    mean
##       1.0850  -0.6042  -0.6853  0.9633
## s.e.  0.0534   0.0426   0.0515  0.0087
## 
## sigma^2 estimated as 0.07533:  log likelihood=-44.56
## AIC=99.11   AICc=99.28   BIC=118.63
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(2,0,1)")

Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] 137.224
BIC(model11)
## [1] 152.8346
AIC(model12)
## [1] 116.0926
BIC(model12)
## [1] 143.411
AIC(model13)
## [1] 99.11418
BIC(model13)
## [1] 118.6273

As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n    mean       sd       CV      MAPE      MAD      MADP     WMAPE
## 1 372 858.207 1099.162 1.280766 0.2689123 233.0372 0.2715396 0.2715396
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n    mean       sd       CV      MAPE      MAD      MADP     WMAPE
## 1 372 858.207 1099.162 1.280766 0.2695938 222.8493 0.2596685 0.2596685
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n    mean       sd       CV      MAPE      MAD      MADP     WMAPE
## 1 372 858.207 1099.162 1.280766 0.2641632 217.3164 0.2532214 0.2532214

Test results are quite little and difference between them are close but if we have to choose one, ARIMA(2,0,1) gives slightly better result in all test parameters. According to performances, ARIMA model is better than MA model, MA model is better than AR model. ARIMA(2,0,1) should be selected for further steps.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(tayt$sold_count,tayt$price)
## [1] -0.2583435
correlation(tayt)
##              Variable Correlation
## 1         visit count  0.05290733
## 2        basket count  0.83720969
## 3       favored count  0.16874292
## 4       category sold  0.90004658
## 5      category visit  0.05368907
## 6     category basket -0.08685947
## 7    category favored  0.57715502
## 8 category brand sold -0.03864685

Obviously there is high correlation between sold counts and those basket count, category sold and category favored. This result actually makes sense because people tend to buy what they favored or basketed. Thanks to function created above, possible regressors are selected easily. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$basket_count)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  tayt$basket_count
##       0.9270  -0.4999  -0.4183     0.8575                  0
## s.e.  0.0778   0.0390   0.0688     0.0199                NaN
## 
## sigma^2 estimated as 0.07117:  log likelihood = -35.99,  aic = 83.99
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_sold)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_sold)
## 
## Coefficients:
##          ar1      ar2     ma1  intercept  tayt$category_sold
##       0.6093  -0.0576  0.2404     0.6511               2e-04
## s.e.  0.1538   0.1151  0.1450     0.0398                 NaN
## 
## sigma^2 estimated as 0.05661:  log likelihood = 5.79,  aic = 0.42
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=tayt$category_favored)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = tayt$category_favored)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  tayt$category_favored
##       1.0271  -0.5787  -0.6007     0.8578                      0
## s.e.  0.0526   0.0405   0.0372     0.0211                    NaN
## 
## sigma^2 estimated as 0.07161:  log likelihood = -37.22,  aic = 86.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 99.11418
AIC(modelx1)
## [1] 83.98627
AIC(modelx2)
## [1] 0.4222833
AIC(modelx3)
## [1] 86.43665

For initial comparision, model arimax2 (regressor as category sold) has absolutely the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n    mean       sd       CV        MAPE       MAD         MADP        WMAPE
## 1 372 858.207 1099.162 1.280766 0.001727326 0.4430522 0.0005162533 0.0005162533
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n    mean       sd       CV      MAPE       MAD       MADP      WMAPE
## 1 372 858.207 1099.162 1.280766 0.0019466 0.5126242 0.00059732 0.00059732
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n    mean       sd       CV        MAPE       MAD         MADP        WMAPE
## 1 372 858.207 1099.162 1.280766 0.002237589 0.5641885 0.0006574038 0.0006574038
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n    mean       sd       CV        MAPE       MAD         MADP        WMAPE
## 1 372 858.207 1099.162 1.280766 0.001601462 0.4117527 0.0004797825 0.0004797825

To interpret the test, all 4 models gives really close results with high performances. Actually model arimax3 (regressor as category favored) gives the lowest WMAPE,MADP,MAD values but the difference between them is neglectable. So what to choose can be decided according to their AIC values. Arimax2 (regressor as category sold) has given remarkably low AIC value which is close to 0 (others were around 80). So we improve the model to arimax model with taking category sold as regressor.

Final Step

After all of tests and operations done above, arimax2 (regressor as category sold) model is selected as final model due to outstandingly low AIC value.

Product 6 (Diş Fırçası)

Step 1

In this step, seasonality of each product is analysed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-6 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days. That idea direct us to check autocorrelation so that determine related seosanality.

Table below shows autocorrelation values at each lag until lag 100.

In this case, autocorrelation is too much at each lag. But frequency of 7,14 and 30 will be examined in decomposition part and seasonality will be selected accordingly. As it will be appeared in the next part, frequancy=7 gives best stationary case and lowest acf values after decomposition. So it would not be wrong to say that there is approximately weekly seasonality.

Decomposition

As far as it is seen below, there is increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(firca$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series seems really low and acceptable except for a few lags. As it is seen in the graph for random (detrended) values are not quite stationary with decreasing variance and changing mean. Its trend tend to increase. But model might be still selected because the other options are even worse. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=14

Required codes are below:

ts13=ts(firca$sold_count, frequency = 14)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=14")

plot(decomposed13)

Autocorrelation seems little higher than the previous one but still better than the one before decomposition.

Detrended values (random) are even less stationary in this case. They have jumps in variance which decreases too. But their means are more constant compared to previous one. Trend in this series is inclined to go up. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values show signs of nonstationarity more than the one with frequancy=7.

Frequency=30

Required codes are below:

ts12=ts(firca$sold_count, frequency = 30)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=30")

plot(decomposed12)

Detrended values are the worst in this model. They show clear nonstationary behavior due to changing variance and mean. Trend component is again inclined to increase. Acf of detrended values are not quite bad. Nevertheless, its random values are worse than weekly one because of changing variance and inconstant mean. Seasonal component seems to have larger terms in number because of larger frequency.

For the further parts, eventhough all of 3 are not quite good, frequency=7 is acceptable and the best one.So, we will continue with the model of frequency = 7 due to slightly better detrended values compared to other options (Slightly more stationary with more constant mean and variance).

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, ACF and PACF plot tends to decrease. For decreasing ACF part, PACF pilot does not show significant figures after lag 2 .AR(2) model can be proposed (autocorrelation is negative at lag 2, this is another sign for AR(2)). For decreasing PACF part, ACF pilot does not show significant figures after lag 5, so MA(5) is proposed. We can also try auto.arima and compare these 3 to select the best model. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.3530  -0.1543  -0.1970     0.9439
## s.e.  0.0512   0.0538   0.0511     0.0212
## 
## sigma^2 estimated as 0.1637:  log likelihood = -188.37,  aic = 386.75
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black). They are little lagged because of AR terms.

Residuals show constant mean but variance changes at some points. There is still some autocorrelation from ACF and PACF

MA(5)

MA(5) is creted below:

model12=arima(decomposed11$random,order=c(0,0,5))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 5))
## 
## Coefficients:
##          ma1      ma2      ma3      ma4      ma5  intercept
##       0.3168  -0.1039  -0.2983  -0.1865  -0.0843     0.9434
## s.e.  0.0525   0.0522   0.0518   0.0482   0.0520     0.0135
## 
## sigma^2 estimated as 0.1589:  log likelihood = -182.91,  aic = 379.82
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too.

Residuals look similar to AR(3) model with little better variance. But from summaries above, AIC value is better (lower) in this model. ACF and PACF plots shows autocorrelation is handled better in this model.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal = FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(2,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1    mean
##       1.0468  -0.4621  -0.7305  0.9436
## s.e.  0.0645   0.0462   0.0580  0.0136
## 
## sigma^2 estimated as 0.1598:  log likelihood=-182
## AIC=374.01   AICc=374.17   BIC=393.52
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black).

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(2,0,1)")

Residuals are still similar to previous 2 models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF. Sometimes auto.arima may not be able to find best model but in this condition, it worked very well.

Model Comparision

AIC(model11)
## [1] 386.7499
BIC(model11)
## [1] 406.2631
AIC(model12)
## [1] 379.822
BIC(model12)
## [1] 407.1404
AIC(model13)
## [1] 374.0077
BIC(model13)
## [1] 393.5209

As it is seen above, model13 which is autoarima with ARIMA(2.0.1) is the best in AIC and BIC value. Second best model is MA(5).

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,369),366))
##     n     mean       sd       CV MAPE      MAD     MADP    WMAPE
## 1 372 92.20968 98.29599 1.066005  Inf 20.02766 0.217197 0.217197
accu(ts11,tail(head(fitted_transformed12,369),366))
##     n     mean       sd       CV MAPE     MAD      MADP     WMAPE
## 1 372 92.20968 98.29599 1.066005  Inf 19.9664 0.2165326 0.2165326
accu(ts11,tail(head(fitted_transformed13,369),366))
##     n     mean       sd       CV MAPE      MAD      MADP     WMAPE
## 1 372 92.20968 98.29599 1.066005  Inf 19.68207 0.2134491 0.2134491

Test results tell all 3 models are close to each other but arima(2,0,1) model is slightly better with higher performance. It has lower error terms. So it will be continued with arima(2,0,1).

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
cor(firca$sold_count,firca$price)
## [1] NA
correlation(firca)
##              Variable Correlation
## 1         visit count   0.8409331
## 2        basket count   0.9515283
## 3       favored count   0.7896215
## 4       category sold   0.3998297
## 5      category visit   0.1229495
## 6     category basket   0.5771166
## 7    category favored   0.4229052
## 8 category brand sold   0.4911847

Obviously there is high correlation between sold counts and those visit count, basket count and favored count. Price data is dirty in this product so best operation to do is not to use price as regressor. However, potentiel problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potentiel regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=firca$visit_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$visit_count)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  firca$visit_count
##       1.0622  -0.4665  -0.7616     0.9207                  0
## s.e.  0.0279   0.0344      NaN        NaN                NaN
## 
## sigma^2 estimated as 0.1545:  log likelihood = -177.89,  aic = 367.79
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=firca$basket_count)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$basket_count)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  firca$basket_count
##       1.0356  -0.4542  -0.7256     0.8750               2e-04
## s.e.  0.0703   0.0485   0.0756     0.0232               1e-04
## 
## sigma^2 estimated as 0.1478:  log likelihood = -169.67,  aic = 351.34
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend

modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=firca$favored_count)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = firca$favored_count)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  firca$favored_count
##       1.0578  -0.4650  -0.7599     0.9097                1e-04
## s.e.  0.0615   0.0464   0.0542     0.0158                1e-04
## 
## sigma^2 estimated as 0.1531:  log likelihood = -176.22,  aic = 364.44
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend
AIC(model13)
## [1] 374.0077
AIC(modelx1)
## [1] 367.7873
AIC(modelx2)
## [1] 351.3435
AIC(modelx3)
## [1] 364.4441

For initial comparision, model arimax2 (regressor as basket count,which has the most correlation value with 0.95) has the least AIC value. If test results for last 1 week does not change the condition, it seems arimax2 is best option to select.

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,369),7)) # arima model
##     n     mean       sd       CV        MAPE      MAD        MADP       WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004081113 0.725805 0.007871245 0.007871245
accu(ts11,tail(head(fitted_transformedx1,369),7)) # arimax1
##     n     mean       sd       CV        MAPE       MAD        MADP       WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004215994 0.6889001 0.007471018 0.007471018
accu(ts11,tail(head(fitted_transformedx2,369),7)) # arimax2
##     n     mean       sd       CV        MAPE       MAD        MADP       WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004182911 0.6607526 0.007165762 0.007165762
accu(ts11,tail(head(fitted_transformedx3,369),7)) # arimax3
##     n     mean       sd       CV        MAPE       MAD        MADP       WMAPE
## 1 372 92.20968 98.29599 1.066005 0.004102785 0.6787281 0.007360704 0.007360704

To interpret the test model arimax2 (regressor as basket count) gives the lowest WMAPE,MADP,MAD values. So model is enhanced with new regressor.

Final Step

After all of tests and operations done above, arimax2 (regressor as basket count) model is selected as final model.

Product 7 (Mont)

ts.plot(mont$sold_count)

It is crystal clear that number of sales for this product is 0 most of the time in our data. Considering that this product is mainly used in a particularly small interval of a year, it makes sense that we only see number of sales bigger than 0 in the dates which are near to winter months. Now, we will plot basket_count data for this product.

ts.plot(mont$basket_count)

Again, there a lot of 0 values for this data.

acf(mont$sold_count, lag=100 ,main="Mont")

We will take a closer look to this data by decreasing the lag parameter.

acf(mont$sold_count, lag=30 ,main="Mont")

We see that frequencies of the peak points are not the same (there is no obvious pattern). And there is no remarkable autocorrelation at important lags like 7,15,30. Therefore, we cannot make comments about seasonality.

ARIMA models are build on time series data and these models make predictions using the previous values, errors etc. However, our data for the mont product does not have sufficient amount of values for sold_count. That is why we cannot build and test ARIMA models for this product. Unfortunately, later steps will not be taken.

Introduction

The aim of this study is to develop some forecasting approaches and to manipulate the data provided by Trendyol so that necessary interpretations can be made. During the study, seasonality for each product will be examined. Then necessary decomposition will be done to use ARIMA models. Furtherly, possible regressors will be checked and ARIMAX, SARIMAX models will be utilized. Best model is going to be chosen according to their performance and test results. Necessary interpretations will be made at each step.

Product names used in the work:

1)Yüz Temizleyici

2)Bebek Islak Mendili

3)Xiaomi Bluetooth Kulaklık

4)Fakir Süpürge

5)Tayt

6)Bikini Model-1

7)Diş Fırçası

8)Mont

9)Bikini Model-2

Step 0

In this step, required libraries are called, necessary data tables are created and the other very beginning operations are executed.

library(data.table)
library(ggplot2)
library(dplyr)
require(mgcv)
library(fpp)
require(gratia)
library(readxl)
library(lubridate)
library(forecast)
library(zoo)
library(stringr)
library(stats)
library(urca)

g13<-read_excel("g13.xlsx") 
dat<-as.data.table(g13)   # Main Data Table

yuzt<-dat[1:372,]        # Data table which holds the data of Product-1 (Yüz Temizleyici)
mendil<-dat[373:744,]    # Data table which holds the data of Product-2 (Bebek Islak Mendili)
xiaomi<-dat[745:1116,]   # Data table which holds the data of Product-3 (Xiaomi Bluetooth Kulaklık)
fakir<-dat[1117:1488,]   # Data table which holds the data of Product-4 (Fakir Süpürge)
tayt<-dat[1489:1860,]    # Data table which holds the data of Product-5 (Tayt)
bik1<-dat[1861:2232,]    # Data table which holds the data of Product-6 (Bikini Model-1)
bik_1<-dat[2132:2232,]   # Data table without the null values of Product-6
firca<-dat[2233:2604,]   # Data table which holds the data of Product-7 (Diş Fırçası)
mont<-dat[2605:2976,]    # Data table which holds the data of Product-8 (Mont)
bik2<-dat[2977:3348,]    # Data table which holds the data of Product-9 (Bikini Model-2)

Product 8 (Bikini Model-1)

Step 1

In this step, seasonality of each product is analyzed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-8 for related period.

As it is seen above, there is a pattern which fluctuates up at certain days and fluctuates down at some other days.The variance seems not so constant. That idea direct us to check autocorrelation so that determine related seasonality.

Table below shows autocorrelation values at each lag until lag 100.

It would not be inappropriate to say that autocorrelation function shows a pattern as well. Around 10th and 44th lag, autocorrelation values get higher and autocorrelation fades away at later lags. Normally, time series object of sold count for product-8 ought to be chosen at the frequency of 10, which will cover other autocorrelations too. However, as it will be observed in the next step; if lag 7 is chosen, the decomposition results will be better compared to lag 10. And at that point, we can cover some other autocorrelation lags better with getting a more stationary data, so it would not be wrong to say that there is weekly seasonality even if autocorrelation function does not show us it directly.

Decomposition

As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(bik_1$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations in the later parts of the time series data which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=15

Required codes are below:

ts12=ts(bik_1$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")

The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values but this product did not give the values we want. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.

Detrended values (random) are less stationary in this case. They do not have constant variance and their means are less constant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.

Frequency=30

Required codes are below:

ts13=ts(bik_1$sold_count, frequency = 30)
decomposed13= decompose(ts13,type="multiplicative")
deseasonalised13=ts13/decomposed13$seasonal
detrend13=deseasonalised13/decomposed13$trend
acf(detrend13,na.action = na.pass, main="Decomposed ACF freq=30")

Detrended values are worst in this model. They show a nonstationary behavior due to changing variance and nonzero mean. Trend component seems to increase more compared to the other two and it fluctuates less. However, first two acf of detrended values are high and not as requested. Its random values are also the worst of 3 models. Seasonal component seems to have larger range, which changes results more.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

As it is understood form ACF and PACF plots, both have high correlations in some parts of the plots. For decreasing ACF part, PACF plot does not show significant figures after lag 3. AR(3) model can be proposed. For decreasing PACF part, ACF plot does not show significant figures after lag 2, so MA(2) is proposed. We can also try auto.arima and compare these 3 to select the best model. It will probably the 3rd model with the auto.arima gives us the best model to build on. Additionaly, it should be mentioned that there is no sign of seasonal arima here.

AR(3)

AR(3) is creted below:

model11=arima(decomposed11$random,order=c(3,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.2455  -0.3144  -0.1884     0.9900
## s.e.  0.1012   0.0982   0.1003     0.0232
## 
## sigma^2 estimated as 0.07927:  log likelihood = -14.62,  aic = 39.24
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are close to real sales amount (black). However this is a simple model and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.

Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels.

MA(2)

MA(2) is creted below:

model12=arima(decomposed11$random,order=c(0,0,2))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 2))
## 
## Coefficients:
##           ma1      ma2  intercept
##       -0.2064  -0.6247     0.9907
## s.e.   0.0739   0.0703     0.0055
## 
## sigma^2 estimated as 0.08212:  log likelihood = -16.74,  aic = 41.49
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.

Residuals show constant mean but variance seems higher first then lower in the later parts in this model. Eventhough ACF and PACF values are better in this model, they are still high.

Auto.Arima

Auto Arima is created below:

model13=auto.arima(decomposed11$random,seasonal=FALSE)
print(model13)
## Series: decomposed11$random 
## ARIMA(2,0,1) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1    mean
##       0.9430  -0.4970  -0.9058  0.9906
## s.e.  0.0924   0.0895   0.0542  0.0051
## 
## sigma^2 estimated as 0.06996:  log likelihood=-7.22
## AIC=24.44   AICc=25.11   BIC=37.21
model13=arima(decomposed11$random,order=c(2,0,1)) # Manuel version of auto.arima (same model)
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(2,0,1)")

Residuals show constant mean but variance differs sometimes but better than the two previous models. ACF and PACF plots tells us this is the best model in terms of lower values of ACF, PACF.

Model Comparision

AIC(model11)
## [1] 39.24273
BIC(model11)
## [1] 52.01212
AIC(model12)
## [1] 41.48683
BIC(model12)
## [1] 51.70234
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975

As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,98),95))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2446731 6.699741 0.2042481 0.2042481
accu(ts11,tail(head(fitted_transformed12,98),95))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2386745 6.370985 0.1942256 0.1942256
accu(ts11,tail(head(fitted_transformed13,98),95))
##     n     mean       sd        CV      MAPE      MAD      MADP     WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.2253659 6.077611 0.1852818 0.1852818

Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests. The remaining tests are equal in all three models due to the saem data we use in all our models.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
correlation(bik_1)
##              Variable Correlation
## 1         visit count   0.7099228
## 2        basket count   0.7960412
## 3       favored count   0.4806033
## 4       category sold   0.6833365
## 5      category visit   0.2349762
## 6     category basket   0.4955038
## 7    category favored   0.4896690
## 8 category brand sold   0.5047960
cor(bik_1$sold_count,bik_1$price)
## [1] -0.05910234

Obviously there is high correlation between sold counts and those basket count, visit count and category sold. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potential regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$basket_count)
## 
## Coefficients:
##         ar1      ar2      ma1  intercept  bik_1$basket_count
##       0.771  -0.4691  -0.6327     0.7984               0.001
## s.e.    NaN   0.0375      NaN        NaN                 NaN
## 
## sigma^2 estimated as 0.06242:  log likelihood = -3.35,  aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$visit_count)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$visit_count)
## 
## Coefficients:
##          ar1     ar2      ma1  intercept  bik_1$visit_count
##       0.8744  -0.505  -0.8680     0.9139                  0
## s.e.  0.0942   0.089   0.0636     0.0005                NaN
## 
## sigma^2 estimated as 0.06188:  log likelihood = -3.36,  aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend


modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_1$category_sold)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_1$category_sold)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  bik_1$category_sold
##       0.9439  -0.4990  -1.0000     0.9718                    0
## s.e.  0.0880   0.0875   0.0332     0.0001                  NaN
## 
## sigma^2 estimated as 0.06025:  log likelihood = -3.59,  aic = 19.18
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,98),7)) # arima model
##     n     mean       sd        CV        MAPE      MAD        MADP       WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007614923 0.305363 0.009309287 0.009309287
accu(ts11,tail(head(fitted_transformedx1,98),7)) # arimax1
##     n     mean       sd        CV        MAPE       MAD        MADP       WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007010133 0.2776748 0.008465184 0.008465184
accu(ts11,tail(head(fitted_transformedx2,98),7)) # arimax2
##     n     mean       sd        CV        MAPE      MAD        MADP       WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.007210265 0.277234 0.008451745 0.008451745
accu(ts11,tail(head(fitted_transformedx3,98),7)) # arimax3
##     n     mean       sd        CV       MAPE       MAD        MADP       WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.00780154 0.2981519 0.009089447 0.009089447

As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.

Final Step

After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.

Product 9 (Bikini Model-2)

Step 1

In this step, seasonality of each product is analyzed to make appropriate decomposition.

Seasonality Analysis

The plot below shows the sold count of product-9 for related period.

As it is seen above, there is a pattern which fluctuates up and down.The variance seems increasing through time. That idea direct us to check autocorrelation so that determine related seasonality.

Table below shows autocorrelation values at each lag until lag 100.

The first lags have decreased over time and that reflects us a increasing trend of our data. Since our data is mostly deleted there are only 2 significant ACF values of lag1 and lag2. We can not speak certainly because of the dirty nature of data but it would not be incorrect to say that autocorrelation increases a little at lag 7, which makes one think of weekly seasonality.

Decomposition

As far as it is seen below, there is a decreasing then increasing variance. So best way of decomposing is utilizing multiplicative method.

Frequency=7

Required codes are below:

ts11=ts(bik_2$sold_count, frequency = 7)
decomposed11= decompose(ts11,type="multiplicative")
deseasonalised11=ts11/decomposed11$seasonal
detrend11=deseasonalised11/decomposed11$trend
acf(detrend11,na.action = na.pass, main="Decomposed ACF freq=7")

ACF of decomposed series are values that are not above the significance levels. Moreover, as it is seen in the graph for random (detrended) values are quite stationary with stable variance and nearly constant mean. Its trend seems to increase with fluctuations which may result in possible errors in the forecast. But model is still acceptable and good. Lastly, its seasonal component shows a decent pattern with an acceptable shape.

Frequency=15

Required codes are below:

ts12=ts(bik_2$sold_count, frequency = 15)
decomposed12= decompose(ts12,type="multiplicative")
deseasonalised12=ts12/decomposed12$seasonal
detrend12=deseasonalised12/decomposed12$trend
acf(detrend12,na.action = na.pass, main="Decomposed ACF freq=15")

The reason why acf values are higher at frequency 15 might be that in Turkey, most people get their salary either at day 1 or day 15 of related month. (Source) This reality may give us low acf values and in the lag 1.0 this gives us a significant autocorrelation. It corresponds to the 15th day of the tiem series data. Probably, people tend to shop more after getting their salary. Even if people generally use credit cards, getting fresh money gives confidence to people to spend more on shopping products. However, as it is mentioned above, acf result of detrended series is worse now. Moreover, the graph below shows the plot of detrended values, which is worse than the one in frequency=7.

Detrended values (random) are less stationary in this case. They do not have constant variance but their mean isconstant compared to previous one. Trend in this series are not so fluctuating but it moves up an down through time. Lastly, seasonal component seems similar. However, this model is worse compared to previous one because detrended random values shows more nonstationarity.

For the further parts, we will continue with the model of frequency = 7.

Step 2

Based on the findings in the first part, it will be proposed ARIMA models and comments on their performance will be done.

There is no clear sign of which model to use. So, AR(2),MA(1) and ARIMA(2,0,1) will be examined.

AR(2)

AR(2) is creted below:

model11=arima(decomposed11$random,order=c(2,0,0))
print(model11)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.3203  -0.3775     0.9892
## s.e.  0.0948   0.0940     0.0280
## 
## sigma^2 estimated as 0.08229:  log likelihood = -16.35,  aic = 40.7
fitted11=fitted(model11)
fitted_transformed11=fitted11*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed11,type = "l",col="blue")

Fitted values (blue) are close to real sales amount (black). However this is a simple model with not a lot of data points and does not fit a high percentage amount of data points. They are little lagged because of the nature of AR models.

Residuals show constant mean but variance differs sometimes. ACF and PACF plot still shows some autocorrelation signs that are above the significance levels at lag 7.

MA(1)

MA(1) is creted below:

model12=arima(decomposed11$random,order=c(0,0,1))
print(model12)
## 
## Call:
## arima(x = decomposed11$random, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.3650     0.9880
## s.e.  0.0936     0.0423
## 
## sigma^2 estimated as 0.09173:  log likelihood = -21.4,  aic = 48.79
fitted12=fitted(model12)
fitted_transformed12=fitted12*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed12,type = "l",col="blue")

Fitted values (blue) are really close to real sales amount (black) in this model too. It is lagged due to the nature of the MA models.

Residuals show fluctuating mean and variance seems not very good in this model.However, ACF and PACF values are better in this model,

ARIMA(2,0,1)

Arima(2,0,1) is created below:

model13=arima(decomposed11$random,order =c(2,0,1))
print(model13)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1))
## 
## Coefficients:
##          ar1      ar2      ma1  intercept
##       0.9430  -0.4970  -0.9058     0.9906
## s.e.  0.0924   0.0895   0.0542     0.0051
## 
## sigma^2 estimated as 0.06702:  log likelihood = -7.22,  aic = 24.44
fitted13=fitted(model13)
fitted_transformed13=fitted13*decomposed11$seasonal*decomposed11$trend
plot(ts11,main="fitted vs real",ylab="Sold Count", xlab="Days")
points(fitted_transformed13,type = "l",col="blue")

Fitted values (blue) are again really close to real sales amount (black). The model fits more when variance is low. Outlier points lowers the fitting point of our model.

plot(residuals(model13), xlab="Days", ylab="Values", main="Residuals")

tsdisplay(fitted13,main="ARIMA(2,0,1)")

Residuals show constant mean but variance is too high and fluctuating.

Model Comparision

AIC(model11)
## [1] 40.69692
BIC(model11)
## [1] 50.91243
AIC(model12)
## [1] 48.79104
BIC(model12)
## [1] 56.45267
AIC(model13)
## [1] 24.44036
BIC(model13)
## [1] 37.20975

As it is seen above, model13 which is autoarima with ARIMA(2,0,1) is the best model with lower values. ARIMA(2,0,1) should be selected for further steps.

Test for fitted values:

accu=function(actual,forecast){
  n=length(actual)
  error=actual-forecast
  mean=mean(actual)
  sd=sd(actual)
  CV=sd/mean
  MAPE=sum(abs(error/actual))/n
  MAD=sum(abs(error))/n
  MADP=sum(abs(error))/sum(abs(actual))
  WMAPE=MAD/mean
  l=data.frame(n,mean,sd,CV,MAPE,MAD,MADP,WMAPE)
  return(l)
}

accu(ts11,tail(head(fitted_transformed11,31),28))
##     n     mean       sd        CV      MAPE      MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1186466 2.884616 0.08794029 0.08794029
accu(ts11,tail(head(fitted_transformed12,31),28))
##     n     mean       sd        CV      MAPE     MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1247375 2.98147 0.09089297 0.09089297
accu(ts11,tail(head(fitted_transformed13,31),28))
##     n     mean       sd        CV      MAPE      MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.1132647 2.620198 0.07987925 0.07987925

Again third result which is ARIMA(2,0,1) gives the best result with lower errors in MAPE, MAD, MADP, WMAPE tests.

Step 3

In this step, possible regressors will be examined to be used on arimax model.

Code below shows correlation between sold count and the data categories.

correlation=function(dt){  # Function to compute correlation between sold counts and other columns in the data
  dt=as.data.frame(dt)
  a=data.frame(data= NA, nrow=8)
  colnames(a)=c("Variable","Correlation")
  a[1,1]="visit count"
  a[2,1]="basket count"
  a[3,1]="favored count"
  a[4,1]="category sold"
  a[5,1]="category visit"
  a[6,1]="category basket"
  a[7,1]="category favored"
  a[8,1]="category brand sold"
  for(i in 1:8) {
    a[i,2]=cor(dt[,4],dt[,i+4])
  }
  return (a)
}
correlation(bik_2)
##              Variable Correlation
## 1         visit count   0.7099228
## 2        basket count   0.7960412
## 3       favored count   0.4806033
## 4       category sold   0.6833365
## 5      category visit   0.2349762
## 6     category basket   0.4955038
## 7    category favored   0.4896690
## 8 category brand sold   0.5047960
cor(bik_2$sold_count,bik_2$price)
## [1] -0.05910234

Obviously there is high correlation between sold counts and those basket count, visit count and favıred count. This result actually makes sense because people tend to buy what they visit more and when the category has raised its attention all products sales’ will increase as well. Thanks to function created above, possible regressors are selected easily. However, potential problem with these regressors is that there is no future values of regressors and they should be predicted as well in the final version of the model.

Step 4

In this step, potential regressors will be used in the model and arimax models will be created.

modelx1=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$basket_count)
print(modelx1)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$basket_count)
## 
## Coefficients:
##         ar1      ar2      ma1  intercept  bik_2$basket_count
##       0.771  -0.4691  -0.6327     0.7984               0.001
## s.e.    NaN   0.0375      NaN        NaN                 NaN
## 
## sigma^2 estimated as 0.06242:  log likelihood = -3.35,  aic = 18.69
fittedx1=fitted(modelx1)
fitted_transformedx1=fittedx1*decomposed11$seasonal*decomposed11$trend

modelx2=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$visit_count)
print(modelx2)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$visit_count)
## 
## Coefficients:
##          ar1     ar2      ma1  intercept  bik_2$visit_count
##       0.8744  -0.505  -0.8680     0.9139                  0
## s.e.  0.0942   0.089   0.0636     0.0005                NaN
## 
## sigma^2 estimated as 0.06188:  log likelihood = -3.36,  aic = 18.71
fittedx2=fitted(modelx2)
fitted_transformedx2=fittedx2*decomposed11$seasonal*decomposed11$trend


modelx3=arima(decomposed11$random,order=c(2,0,1),xreg=bik_2$favored_count)
print(modelx3)
## 
## Call:
## arima(x = decomposed11$random, order = c(2, 0, 1), xreg = bik_2$favored_count)
## 
## Coefficients:
##          ar1      ar2      ma1  intercept  bik_2$favored_count
##       0.8885  -0.5054  -0.8556     0.9579                1e-04
## s.e.  0.0922   0.0886   0.0557        NaN                  NaN
## 
## sigma^2 estimated as 0.06526:  log likelihood = -5.83,  aic = 23.65
fittedx3=fitted(modelx3)
fitted_transformedx3=fittedx3*decomposed11$seasonal*decomposed11$trend

3 models are constructed according to related one of 3 regressors. Now they will be tested for last one week.

accu(ts11,tail(head(fitted_transformed13,31),7)) # arima model
##     n     mean       sd        CV       MAPE       MAD     MADP    WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04803259 0.6914002 0.021078 0.021078
accu(ts11,tail(head(fitted_transformedx1,31),7)) # arimax1
##     n     mean       sd        CV       MAPE       MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.03311664 0.5477929 0.01669999 0.01669999
accu(ts11,tail(head(fitted_transformedx2,31),7)) # arimax2
##     n     mean       sd        CV       MAPE       MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04203058 0.6142789 0.01872689 0.01872689
accu(ts11,tail(head(fitted_transformedx3,31),7)) # arimax3
##     n     mean       sd        CV       MAPE       MAD       MADP      WMAPE
## 1 101 32.80198 15.34472 0.4677986 0.04493964 0.6470274 0.01972526 0.01972526
AIC(model13)
## [1] 24.44036
AIC(modelx1)
## [1] 18.69395
AIC(modelx2)
## [1] 18.71035
AIC(modelx3)
## [1] 23.65499

As it is seen in the results, they all have similar results but model arimax1 is slightly better in terms of lower WMAPE and other parameters. Moreover, model shows better performance with arimax1 (regressor as basket count) than arima model in previous step.

Final Step

After all of tests and operations done above, arimax1 model gives best results and it is selected as final model.